Business Problem
This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.
“Concrete is the most important material in civil engineering” as said by Prof. I-Cheng Yeh
Concrete compression strength is determined not just only by water-cement mixture but also by other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relationship between the ingredients concentration and the age of testing to the compression strength.
Data Processing
Load Libraries
library(dplyr)
library(caret)
library(tidyr)
library(randomForest)
library(ggplot2)
library(lime)
library(GGally)
library(performance)
library(MLmetrics)
library(lmtest)
library(car)
Read Data
# read data
train <- read.csv("data/data-train.csv")
The observation data consists of the following variables:
id: Id of each cement mixture,cement: The amount of cement (Kg) in a m3 mixture,slag: The amount of blast furnace slag (Kg) in a m3 mixture,flyash: The amount of fly ash (Kg) in a m3 mixture,water: The amount of water (Kg) in a m3 mixture,super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,fine_agg: The amount of Fine Aggreagate (Kg) in a m3 mixture,age: the number of resting days before the compressive strength measurement,strength: Concrete compressive strength measurement in MPa unit.
In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.
Pre-processing
# data structure
glimpse(train)
## Rows: 825
## Columns: 10
## $ id <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10…
## $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19…
## $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132…
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228…
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, …
## $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82…
## $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36…
## $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38…
# check missing value
colSums(is.na(train))
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
# remove duplicate
unique(train)
# remove row containing NA value
train <- train %>% filter(complete.cases(.))
Check data distribution of each predictor
train %>%
select_if(is.numeric) %>%
boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')
Our data can be visually examined to identify whether any outliers are present. By requiring our model to accommodate them, outliers impact the dependent variable we’re developing. As their names indicate, outliers lie outside our model’s majority. The resolving capability of our model might be reduced if we include outliers. We can observe from the boxplot that some variables, such age, super plast, and slag, have noticeable outliers.
Distribution on Each Predictor
train %>%
select_if(is.numeric) %>%
pivot_longer(cols = -strength, names_to = 'predictor') %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~predictor, scales = 'free_x') +
labs(
title = 'Density graph of each variable',
x = 'variable',
y = 'Frequency'
)
The graph shows that flyash, slag, coarse, fine, and cement are all fairly uniform in shape. This can imply that these variables are combined freely and without following a prescribed dosage. These are the fundamental components of concrete. Most of the recipes tend to utilize between 150 and 200 milliliters of water. Superplast appears to be either not utilized at all or used in an amount of 10. Most of the recipes are 7, 30, 60, 90–100 days.
Data Transformation
Let’s see the trend of our data for each predictor
train %>%
select_if(is.numeric) %>%
pivot_longer(cols = -strength, names_to = 'predictor') %>%
ggplot(aes(x = value, y = strength)) +
geom_point() +
geom_smooth(method = 'loess', formula = 'y ~ x') +
facet_wrap(~predictor, scales = 'free_x') +
labs(
title = 'Trends in each variable',
x = 'Variable',
y = 'Values'
)
According to the plots, cement and
super_plast show a significant positive correlation with
strength. A minor negative correlation exists between
coarse_agg, fine_agg, fly_ash,
and slag. There is no direct relationship between
age and water. Water has a
cyclical pattern, whereas the period displays a negative curve. With
linear data, regression models perform the best. We can attempt to
modify the distribution to become more linear by transforming the
non-linear data.
Age as Log(Age)
train %>%
select_if(is.numeric) %>%
select(age, strength) %>%
ggplot(aes(x = log(age), y = strength)) +
geom_point() +
geom_smooth(method = 'loess', formula = 'y ~ x') +
labs(
title = 'Correlation between log(age) and strength',
x = 'log(age)',
y = 'strength'
)
We see that the two relation is more linear. We’ll persist this
change to our dataset by transforming Age to
Log(Age)
Transform Age to Log(Age)
train_log <- train %>% mutate(age = log(age))
Water as Log(Water)
train %>%
select_if(is.numeric) %>%
select(water, strength) %>%
ggplot(aes(x = log(water), y = strength)) +
geom_point() +
geom_smooth(method = 'loess', formula = 'y ~ x') +
labs(
title = 'Correlation between log(age) and strength',
x = 'log(age)',
y = 'strength'
)
Since the shape is still cyclical, this transform has no effect.
Data Scaling
train_log %>%
select_if(is.numeric) %>%
pivot_longer(cols = -strength, names_to = 'predictor') %>%
group_by(predictor) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = predictor, y = value)) +
geom_col(fill = 'pink') +
labs(
title = 'Data Range Before Scaling',
x = 'Variable',
y = 'Value'
) + theme_minimal()
Before we scale train_log, we need to remove non-numeric
column id
# data scaling
train_scale <- train_log %>% select(-id) %>% as.data.frame()
train_scale[,-9] <- scale(train_scale[,-9])
train_scale %>%
pivot_longer(cols = -strength, names_to = 'predictor') %>%
group_by(predictor) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = predictor, y = value)) +
geom_col(fill = 'pink') +
labs(
title = 'Data Range After Scaling',
x = 'Variable',
y = 'Values'
) + theme_minimal()
Researchers need to scale the data to depict each variable’s impact accurately. By mounting the data, we give each variable equal weight so that we can appropriately interpret the model’s coefficients.
Exploratoty Data Analysis
Correlation
# cek korelasi
ggcorr(train_scale, hjust = 1, label = TRUE)
The stronger the correlation, or how near 1 or -1 it is, the more
closely related the predictors are. The correlation matrix graphic above
shows the correlatiion on each variables. In our dataset,
super_plast and water have the highest
negative correlations (-0.6) also strength and
age have the highest positive correlations (0.6)
age, cement, and super_plast
have the most significant positive strength relationships. This
indicates that the variables positively and substantially contribute to
strength. On the other hand, the most vital negative link is found with
water most negative correlation on the other hand.
Handling Outliers
Find outlier value
# Check the outlier after scaling
boxplot.stats(train_scale$strength)$out
boxplot((train_scale$strength))
Remove the outlier after scaling
# remove the outlier after scaling
# train_clean <- train_scale[train_scale$strength < 2.652730 ,]
train_clean <- train_scale[train_scale$strength < 79.40,]
# train_clean <- train_scale[train_scale$strength > -2.472288,]
boxplot(train_clean$strength)
Modeling with one predictor
model_ols <- lm(formula = strength ~ cement, data = train_scale)
model_ols_no_out <- lm(formula = strength ~ cement, data = train_clean)
Plot the difference between two data
plot(strength ~ cement, data=train_scale)
abline(model_ols, col = "red")
abline(model_ols_no_out, col = "green")
High Leverage, Low Influence: Because the graph shows that the outlier of the strength variable is at High Leverage, Low influence, then we analyze from R-Squared.
R-squared
summary(model_ols)$r.squared
## [1] 0.246024
summary(model_ols_no_out)$r.squared
## [1] 0.242262
Since the train_scale has a better r-square, we decided
to not using train_clean
Data Distribution of Each Predictor
train_scale %>% as_tibble() %>%
pivot_longer(cols = -c(strength), names_to='Predictor') %>%
ggplot(aes(x = Predictor, y = value)) +
geom_jitter(col = 'blue', alpha = 0.2) +
labs(
title = 'Data Distribution of Each Predictor',
x = 'Predictor',
y = 'Values'
) + theme_minimal()
Model Fitting and Evaluation
Data Splitting
We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.
library(rsample)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
index <- sample(nrow(train_scale), nrow(train_scale)*0.8)
data_train <- as.data.frame(train_scale[index,])
data_validation <- as.data.frame(train_scale[-index,])
Check the Data Split
set.seed(123)
control <- trainControl(method = "cv", number = 10)
ca_model <- train(strength ~ ., data = data_train, method = "lm", trControl = control)
ca_model
## Linear Regression
##
## 660 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 595, 595, 593, 593, 595, 593, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.949791 0.8291556 5.431038
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Model Fitting
Model with No Predictor
# model tanpa prediktor
model_none <- lm(formula = strength ~ 1, data = data_train)
Model with All Predictors
# model dengan semua prediktor
model_all <- lm(strength ~ ., data_train)
Variable Selection : Step-Wise Regression Model
We’ve built model.none that uses no predictor and model.all that uses all variables. Stepwise regression is a method to pick out the optimal model using the Akaika Information Criterion (AIC) as is metrics. The method optimizes the model for the least AIC, meaning the least information loss. Let’s try to pick the important variables using stepwise regression. It uses a greedy algorithm to find a local minima. Therefore, it does not guarantee the best model.
1. Backward
# stepwise regression: backward elimination
model_backward <- step(object = model_all,
direction = "backward",
trace = FALSE) # agar proses step-wise tidak ditampilkan
2. Forward
model_forward <- step(
object = model_none, # batas bawah
direction = "forward",
scope = list(upper = model_all), # batas atas
trace = FALSE) # tidak menampilkan proses step-wise
3. Both
model_both <- step(
object = model_none, # bawah batas
direction = "both",
scope = list(upper = model_all), # batas atas
trace = FALSE
)
Model Evaluation
We developed a model_none that doesn’t employ a model or predictor. All variables are used. The Akaike Information Criterion (AIC) and metrics are used stepwise regression to determine the best model. To minimize information loss, the technique optimizes the model for the lowest AIC. Let’s use stepwise regression to identify the crucial factors. To locate a local minimum, it employs a greedy method. As a result, it cannot assure the best model.
comparison <- compare_performance(model_none, model_all, model_forward, model_both)
as.data.frame(comparison)
## Name Model AIC AIC_wt BIC BIC_wt R2
## 1 model_none lm 5587.899 1.179650e-250 5596.883 2.674447e-243 0.0000000
## 2 model_all lm 4436.943 9.971489e-01 4481.865 3.551574e-01 0.8293458
## 3 model_forward lm 4451.224 7.900568e-04 4487.162 2.513473e-02 0.8245528
## 4 model_both lm 4449.306 2.061075e-03 4480.752 6.197079e-01 0.8245309
## R2_adjusted RMSE Sigma
## 1 0.0000000 16.631325 16.643938
## 2 0.8272486 6.870453 6.917782
## 3 0.8229407 6.966266 7.003505
## 4 0.8231894 6.966700 6.998585
Evaluation Function
eval_recap <- function(truth, estimate){
df_new <- data.frame(truth = truth,
estimate = estimate)
data.frame(RMSE = RMSE(estimate, truth),
MAE = MAE(estimate, truth),
"R-Square" = R2_Score(estimate, truth),
MAPE = MAPE(estimate, truth),
check.names = F
) %>%
mutate(MSE = sqrt(RMSE))
}
Model None - Evaluation
# Model None - Evaluation
pred_none_val <- predict(model_none, data_validation)
eval_recap(truth = data_validation$strength,
estimate = pred_none_val)
## RMSE MAE R-Square MAPE MSE
## 1 17.01379 13.48017 -0.01144447 0.5953731 4.124777
Model All - Evaluation
pred_all_val <- predict(object = model_all, newdata = data_validation)
eval_recap(truth = data_validation$strength,
estimate = pred_all_val)
## RMSE MAE R-Square MAPE MSE
## 1 7.905688 5.945323 0.7816166 0.1917857 2.811706
Model Step-Wise Both - Evaluation
pred_both_val <- predict(object = model_both, newdata = data_validation)
eval_recap(truth = data_validation$strength,
estimate = pred_both_val)
## RMSE MAE R-Square MAPE MSE
## 1 8.005901 6.05212 0.7760451 0.1982521 2.82947
As shown above, model_all has the best evaluation score. Now, we’re check the linearity assumption
Checking Assumptions
Linear models are made with 4 assumptions. Before we carry on, we have to check whether these assumptions hold for our model.
Assumption of linearity
The assumption of linearity assumes that there exists a linear relationship between the predictors and the targe variable, so that our model can correctly describe it. A visual way to evaluate this is to plot the value of residues between our plot and the model.
Visualization of residual histogram using
hist(). function
# histogram residual
ggplot(data = as.data.frame(model_all$residuals), aes(x = model_all$residuals)) +
geom_histogram(fill = "#CC0000", color = "orange", bins = 30) +
labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")
Statistics Test with `shapiro.test()``
Shapiro-Wilk hypothesis test:
- H0: Residuals are normal distributed
- H1: Residuals are not normally distributed (heteroscedastic)
# shapiro test dari residual
shapiro.test(model_all$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_all$residuals
## W = 0.99677, p-value = 0.2076
check_normality(model_all)
## OK: residuals appear as normally distributed (p = 0.222).
Based on the result, the residuals are normally distributed.
VIF : Independence of Variable
Multicollinearity is a condition with a strong correlation between predictors. This is undesirable because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a solid relationship. It is hoped that multicollinearity will not occur
Test the VIF (Variance Inflation Factor) with the vif()
function from the car package: * VIF value > 10:
multicollinearity occurs in the model * VIF value < 10: there is no
multicollinearity in the model
vif(model_all)
## cement slag flyash water super_plast coarse_agg
## 7.779963 6.785129 6.297809 6.777976 3.072010 4.873030
## fine_agg age
## 7.150553 1.030438
The test result means there is no multicollinearity in the model
Homoscedasticity
Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :
- H0: Value of error is the same across all inputs (homoscedastic)
- H1: Value of error is not the same across all range of inputs (heteroscedastic)
plot(x = model_all$fitted.values, y = model_all$residuals)
abline(h = 0, col = "#FF0000", ylab = 'Residuals', xlab = 'Prediction')
We can test the homoscedasticity of the model using the Breusch-Pagan test.
bptest(model_all)
##
## studentized Breusch-Pagan test
##
## data: model_all
## BP = 74.093, df = 8, p-value = 7.492e-13
Based on the result, the error are not same across all range of inputs.
Even though our linear model fails the tests, we can still try to conclude it. Our model’s mean average percentage error is a decent 0.198.
coef_all <- model_all$coefficients[-1]
barplot(coef_all, xlab = names(coef_all), main = 'Influence of `Model_all` Predictor', ylab = 'Value')
Model Interpretation and Improvement Ideas
We shouldn’t transform the data_train because we already did it before in the beginning such as scaling, tranforming several variable into log, or removing any outliers and we are not tranforming the targeted variabel into a scaled version, because we wont scaled back the Test Result in the end of our research.
One-Way ANOVA
anova_train <- aov(formula = strength ~ ., data = data_train)
summary(anova_train)
## Df Sum Sq Mean Sq F value Pr(>F)
## cement 1 45198 45198 944.470 < 2e-16 ***
## slag 1 13610 13610 284.391 < 2e-16 ***
## flyash 1 14776 14776 308.765 < 2e-16 ***
## water 1 8018 8018 167.546 < 2e-16 ***
## super_plast 1 921 921 19.240 1.34e-05 ***
## coarse_agg 1 487 487 10.171 0.00149 **
## fine_agg 1 316 316 6.599 0.01042 *
## age 1 68077 68077 1422.548 < 2e-16 ***
## Residuals 651 31154 48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Orthogonal Polynomial
model_polym <- lm(strength ~ polym(cement , slag, flyash, water, super_plast, coarse_agg, fine_agg, age, degree = 2, raw = T), data_train )
pred_polym_val <- predict(object = model_polym, newdata = data_validation)
eval_recap(truth = data_validation$strength,
estimate = pred_polym_val)
## RMSE MAE R-Square MAPE MSE
## 1 6.36655 4.805496 0.8583722 0.1431404 2.523202
Checking Assumptions of Orthogonal Polynomial Model
Residuals Autocorrelation
We will check whether the residuals are correlating with itself using the Durbin-Watson test.
- H0: p-value > 0.05 : Residuals are not autocorrelated
- H1: p-value < 0.05 : Residuals are autocorrelated
dwtest(model_polym)
##
## Durbin-Watson test
##
## data: model_polym
## DW = 2.0129, p-value = 0.5629
## alternative hypothesis: true autocorrelation is greater than 0
Based on the result, the residuals are not autocorrelated.
VIF: Independence of variabels
Due to the variables in our model no longer existing independently, we cannot estimate this factor when utilizing polynomials. This will become clearer as we examine the model in more detail in the following subsection.
Statistics Test with `shapiro.test()``
Shapiro-Wilk hypothesis test:
- H0: Residuals are normal distributed
- H1: Residuals are not normally distributed (heteroscedastic)
# shapiro test dari residual
shapiro.test(model_polym$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_polym$residuals
## W = 0.99168, p-value = 0.0009204
check_normality(model_polym)
## Warning: Non-normality of residuals detected (p < .001).
# histogram residual
ggplot(data = as.data.frame(model_polym$residuals), aes(x = model_polym$residuals)) +
geom_histogram(fill = "pink", color = "black", bins = 30) +
labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")
Based on the result, the residuals are not normally distributed.
Homoscedasticity
Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :
- H0: Value of error is the same across all inputs (homoscedastic)
- H1: Value of error is not the same across all range of inputs (heteroscedastic)
We can test the homoscedasticity of the model using the Breusch-Pagan test.
bptest(model_polym)
##
## studentized Breusch-Pagan test
##
## data: model_polym
## BP = 153.77, df = 44, p-value = 4.359e-14
plot(x = model_polym$fitted.values, y = model_polym$residuals)
abline(h = 0, col = "red", ylab = 'Residuals', xlab = 'Prediction')
Based on the result, the error are not same across all range of inputs.
Even though our linear model fails the tests, we can still try to conclude it. Our model’s mean average percentage error is a decent 0.143.
Random Foresttion
Create random forest model as
model_rf
set.seed(123)
model_rf <- randomForest(x = data_train %>% select(-strength),
y = data_train$strength,
ntree = 500)
model_rf
##
## Call:
## randomForest(x = data_train %>% select(-strength), y = data_train$strength, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 34.47698
## % Var explained: 87.54
Check the summary and Predictor contribution on Targeted Variable
model_rf$finalModel
## NULL
varImp(model_rf)
## Overall
## cement 36292.387
## slag 9936.825
## flyash 8554.180
## water 22655.250
## super_plast 17046.985
## coarse_agg 10924.007
## fine_agg 12855.460
## age 52901.006
Model Random Forest - Evaluation
pred_rf_val <- predict(object = model_rf, newdata = data_validation)
eval_recap(truth = data_validation$strength,
estimate = pred_rf_val)
## RMSE MAE R-Square MAPE MSE
## 1 6.055747 4.368295 0.8718627 0.1632309 2.460843
Random Forest Variable Importance on Targeted Variabel
library("tibble")
model_rf$importance %>%
as.data.frame() %>%
arrange(-IncNodePurity) %>%
rownames_to_column("variable") %>%
head(10) %>%
ggplot(aes(IncNodePurity,
reorder(variable, IncNodePurity))
) +
geom_col(fill = "firebrick") +
labs(x = "Importance",
y = NULL,
title = "Random Forest Variable Importance")
The plot above showing how big the influence of each predictor, top 3
predictor who correlate with
strength is age,
cement and water
Lime Interpretation
library(lime)
set.seed(123)
explainer <- lime(x = data_train %>% select(-strength),
model = model_rf)
model_type.randomForest <- function(x){
return("regression") # for regression problem
}
predict_model.randomForest <- function(x, newdata, type = "response") {
# return prediction value
predict(x, newdata) %>% as.data.frame()
}
# Select only the first 4 observations
selected_data <- data_validation %>%
select(-strength) %>%
slice(1:4)
# Explain the model
set.seed(123)
explanation <- explain(x = selected_data,
explainer = explainer,
kernel_width = 1,
dist_fun = "manhattan",
n_features = 8 # Number of features to explain the model
)
Since we’re using scaled data from the beginning, so to visualize
model_rf, we’re still using scaled data.
Random Forest Visualization dan Interpretation
plot_features(explanation = explanation)
Explanation Fit indicate how good LIME explain the model, kind of like
the \(R^2\) (R-Squared) value of linear
regression. Here we see the Explanation Fit only has values around
0.50-0.7 (50%-70%), which can be interpreted that LIME can only explain
a little about our model. Almost all of the cases reached the standard
which >= 50% (0.5), only Case 4 has explanation fit under 0.50. We
also can summarise that Case 3 has the biggest Explanation, but Case 1
has the biggest Prediction.
Support Vector Machine
library(e1071)
model_svm <- svm(strength ~ ., data = data_train)
pred_svm_val <- predict(object = model_svm, newdata = data_validation)
eval_recap(truth = data_validation$strength,
estimate = pred_svm_val)
## RMSE MAE R-Square MAPE MSE
## 1 5.544694 3.793846 0.8925775 0.1214288 2.354717
The SVR model has higher performance compared to any model that we made before. However, we will still use both model for further analysis both as comparison and as examples.
Lime Interpretation
# create the explanation for the SVR model.
set.seed(123)
explainer_svm <- lime(x = data_train %>% select(-strength),
model = model_svm)
# Create SVR model specification for lime.
model_type.svm <- function(x){
return("regression") # for regression problem
}
predict_model.svm <- function(x, newdata, type = "response") {
# return prediction value
predict(x, newdata) %>% as.data.frame()
}
Random Forest Visualization dan Interpretation
set.seed(123)
explanation_svm <- explain(x = selected_data,
explainer = explainer_svm,
kernel_width = 1,
dist_fun = "manhattan",
feature_select = "auto", # Method of feature selection for lime
n_features = 10 # Number of features to explain the model
)
plot_features(explanation_svm)
Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.50-0.7 (50%-70%), which can be interpreted that LIME can only explain a little about our model. Almost all of the cases reached the standard which >= 50% (0.5), only Case 4 has explanation fit under 0.50. We also can summarise that Case 3 has the biggest Explanation, but Case 1 has the biggest Prediction.
From Case 3, we get the insigth of three predicor who has the big
influence to strength is age,
cement, and water. And on Case 4,
cement, water and coarse_agg
dominated on who big the can control the concrete
strength.
Finding a Better Material Composition
Linear Model
# Gather Top 10 "Strength"
top_mix <- train_scale %>% arrange(-strength) %>% head(20)
influence <- top_mix %>% pivot_longer(cols = -c(strength), names_to='Predictor')
# train_class the data
data_train %>%
pivot_longer(cols = -c(strength), names_to='Predictor') %>%
ggplot(aes(x = Predictor, y = value)) +
geom_jitter(col = 'red', alpha = 0.2) +
geom_jitter(data = influence, aes(x = Predictor, y = value), col = 'blue') +
labs(
title = 'Comparing top 10 Predictor vs. Data',
x = 'Predictor',
y = 'Values'
)
Clustering Age to three Classes
train_class <- train %>% select(-id) %>% mutate(
age = case_when(age < 40 ~ "< 40",
age >= 40 & age <= 80 ~ "40 - 80",
age > 80 ~ "> 80",
))
Age Class on Strength
ggplot(data = train_class, aes(x = age, y = strength)) +
geom_boxplot() +
geom_jitter(aes(color = age), show.legend = F) +
labs(x = "Age",
y = "Strength") +
theme_minimal()
It is known that different days affect construction strength
differently based on the outcomes of the statistical summary and
visualization of the distribution of strength data for each
age class above. strength diminishes with time
for a median of 30 days, or less than 40 days on average. At the same
time, 40 to 80 days are best because they generate the most with a
median strength of 50.
Anova Model
anova_class <- aov(formula = strength ~ age, data = train_class)
summary(anova_class)
## Df Sum Sq Mean Sq F value Pr(>F)
## age 2 49007 24504 111.2 <2e-16 ***
## Residuals 822 181204 220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is less than the significance level (alpha), which
is 5% or 0.05, we can conclude that there is a significant difference
between the time period of age applied to the strength of
the construction.
Pairwise Mean Comparison
TukeyHSD(anova_class)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = strength ~ age, data = train_class)
##
## $age
## diff lwr upr p adj
## > 80-< 40 15.850947 12.6941891 19.007704 0.0000000
## 40 - 80-< 40 20.040109 15.6653237 24.414894 0.0000000
## 40 - 80-> 80 4.189162 -0.8168007 9.195125 0.1216492
Except for the fumigation dose between 40-80 - > 80, two of the three p-values were below the significance level (alpha), or 5% or 0.05. We can assume that concrete strength better after 40 days.
Anova Assumption
Homogeneity of Variance
Hypothesis : - H0: Between categories has a homogeneous variance. - H1: Between categories has a heterogeneous variance.
leveneTest(strength ~ age, train_class)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 3.5031 0.03056 *
## 822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value is less than the significance level (alpha), which is 5% or 0.05, we can conclude that between categories have homogeneous variance.
Normality Residuals
Hypothesis : - H0: Residuals are normal distributed - H1: Residuals are not normally distributed (heteroscedastic)
shapiro.test(anova_class$residuals)
##
## Shapiro-Wilk normality test
##
## data: anova_class$residuals
## W = 0.97014, p-value = 5.912e-12
Based on the result, the residuals are not normally distributed.
We can conclude that the best periode for concrete is around 40 - 80 days. It’s still okay to set the periode on > 80 days, because the median of the data distribution is not that far from the 40-80 days periode.
Composing Best Mixture Based on Random Forest and Linear Model Interpretation
# mix and match based on how big the influence of predictors on targeted variable
comp_1 <- top_mix %>% select(-strength) %>% mutate_all(mean) %>% head(1)
comp_2 <- top_mix %>% select(-strength) %>%
mutate(cement = mean(top_mix$cement),
water = min(top_mix$water),
coarse_agg = mean(top_mix$coarse_agg)) %>% head(1)
comp_3 <- top_mix %>% select(-strength) %>%
mutate(cement = mean(top_mix$cement),
slag = max(top_mix$slag),
water = min(top_mix$water)) %>% head(1)
comp_4 <- top_mix %>% select(-strength) %>%
mutate(cement = weighted.mean(top_mix$cement),
flyash = min(top_mix$flyash),
super_plast = mean(top_mix$super_plast),
coarse_agg = mean(top_mix$coarse_agg)) %>% head(1)
# merged Top 5 Mix and New Composition
composition <- bind_rows(comp_1, comp_2, comp_3, comp_4)
# Predict New Composition with Model Polynomial
new_comp <- predict(model_svm, composition)
composition <- composition %>% mutate(strength = new_comp)
new_formula <- composition %>% mutate(formula = c('C1', 'C2', 'C3','C4'))
new_formula
## cement slag flyash water super_plast coarse_agg fine_agg
## 1 1.089695 0.7572579 -0.6391721 -0.9236385 0.8343025 -0.1917599 -0.3567070
## 2 1.089695 1.3450618 -0.8457888 -1.7208632 2.6336581 -0.1917599 -0.2447674
## 3 1.089695 2.4343832 -0.8457888 -1.7208632 2.6336581 -0.3666832 -0.2447674
## 4 1.089695 1.3450618 -0.8457888 -1.6641549 0.8343025 -0.1917599 -0.2447674
## age strength formula
## 1 0.7417364 75.05045 C1
## 2 1.1320080 78.12443 C2
## 3 1.1320080 76.77664 C3
## 4 1.1320080 81.63057 C4
train_scale %>% arrange(-strength) %>% head(1) %>%
mutate(id = "C4",
cement = new_formula[4,'cement'][[1]],
flyash = new_formula[4,'flyash'][[1]],
super_plast = new_formula[4,'super_plast'][[1]],
coarse_agg = new_formula[4,'coarse_agg'][[1]],
strength = new_formula[4,'strength'][[1]]
)
## cement slag flyash water super_plast coarse_agg fine_agg
## 1 1.089695 1.345062 -0.8457888 -1.664155 0.8343025 -0.1917599 -0.2447674
## age strength id
## 1 1.132008 81.63057 C4
Conclusion
In this research project, we have examined various concrete formulations with different strengths. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the strength.
Throughout this project, we have employed a
Support Vector Machine/ Reggresion Model. Compared to a
standard regression, the model better describes the data. As we have
discovered, despite being more complicated, it is a model which could be
understood. The prediction model implementing “model_svm” obtained MAE
values of 3.79 and R Square of 89 percent on validation dataset and MAE
values of 5.62 and R Square of 81 percent on test dataset.
The methodology adopted in this project can be used for other issues wherever we want to optimize a result. This method can resolve optimization issues, including improving a food product’s flavor, consistency, and texture or determining the ideal chemical mixture to produce a specific flavor and aroma or fragrance. Regression models offer a simple approach to get insight and identify possibilities when used on the relevant issue.
Submission
test <- read.csv("data/data-test.csv")
test[,-c(1,10)] <- scale(test[,-c(1,10)])
test_clean <- test %>% select(-c(id, strength))
comparison <- compare_performance(model_none, model_all, model_both, model_polym)
as.data.frame(comparison)
## Name Model AIC AIC_wt BIC BIC_wt R2
## 1 model_none lm 5587.899 4.739201e-305 5596.883 3.951068e-262 0.0000000
## 2 model_all lm 4436.943 4.006008e-55 4481.865 5.246883e-20 0.8293458
## 3 model_both lm 4449.306 8.280290e-58 4480.752 9.155193e-20 0.8245309
## 4 model_polym lm 4186.434 1.000000e+00 4393.077 1.000000e+00 0.8953114
## R2_adjusted RMSE Sigma
## 1 0.0000000 16.631325 16.643938
## 2 0.8272486 6.870453 6.917782
## 3 0.8231894 6.966700 6.998585
## 4 0.8878215 5.381167 5.574563
# predict target using your model
pred_test <- predict(object = model_svm, newdata = test_clean)
# Create submission data
submission <- data.frame(id = test$id,
strength = pred_test
)
# save data
write.csv(submission, "submission-sara.csv", row.names = F)
# check first 3 data
head(submission, 3)
## id strength
## 1 S826 40.99496
## 2 S827 30.51722
## 3 S828 41.74776
knitr::include_graphics("data/model-svm.png")